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Abstract —  Two-dimensional  (2-D)  processing  of  speech  has 
recently  been  explored  as  an  alternative  representational 
framework  that  explicitly  analyzes  temporal,  spectral,  and  joint 
spectrotemporal  energy  fluctuations  or  “modulations”  present  in 
time-frequency  distributions  (e.g.,  in  the  spectrogram  or  auditory 
spectrogram).  This  paper  considers  2-D  Fourier  analysis  of  local 
time-frequency  regions  of  wideband  spectrograms,  a 
representation  referred  to  as  the  (wideband)  Grating 
Compression  Transform  (WGCT).  We  develop  frequency- 
dependent  models  of  speech  signals  in  the  WGCT  context  related 
to  speech  production  characteristics,  building  on  previous  work 
in  modeling  narrowband-based  GCT  representations.  Model 
evaluation  through  simulations  and  error  analysis  is  performed. 
Comparison  shows  the  model  effectiveness,  and  important 
distinctions,  including  “dual”  behavior,  between  the  wide  and 
narrowband  models.  Our  results  motivate  a  novel  taxonomy  of 
speech  signal  behavior  for  use  as  an  interpretative  framework 
(i.e.,  in  relation  to  speech  production  characteristics)  for  2-D 
processing  of  speech  using  the  GCT  and  potentially  other  2-D 
approaches  and  time-frequency  distributions.  We  demonstrate 
the  ability  of  the  model  to  represent  real  speech  content  through 
using  demodulation  techniques  for  analysis/synthesis  of 
wideband  spectrograms  and  co-channel  speaker  separation  using 
prior  pitch  information. 

Index  Terms — 2-D  processing  of  speech,  Grating  Compression 
Transform,  wideband  spectrogram,  spectrogram  reconstruction, 
co-channel  speaker  separation 

I.  INTRODUCTION 

Two-dimensional  (2-D)  processing  of  speech  has  recently 
been  explored  as  an  alternative  representational  approach 
that  explicitly  analyzes  temporal,  spectral,  and  joint 
spectrotemporal  energy  fluctuations  or  “modulations”  present 
in  time-frequency  distributions  (e.g.,  in  the 
spectrogram/auditory  spectrogram).  Examples  of  this  include 
auditory  models  [1][2],  the  modulation  spectrogram  [3],  and 
our  previous  work  in  2-D  Fourier  analysis  of  spectrograms 
[4][5][6][7].  Though  these  representations  have  been 
interpreted  implicitly  using  data-driven  techniques  [2]  or 
analytically  in  relation  to  modulation  theory  [8],  they  have 
nonetheless  been  difficult  to  interpret  from  a  parametric 
perspective  in  relation  to  speech-specific  characteristics  (e.g., 
pitch)  [9].  The  aim  of  this  work  and  our  previous  work  in  [10] 
is  to  provide  a  speech-based  interpretive  framework  for  the 
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concept  of  “modulation”  in  2-D  processing  of  speech. 

In  [10],  we  developed  speech-specific  signal  models  for 
localized  2-D  Fourier  analysis  of  the  narrowband  spectrogram 
[7],  a  representation  referred  to  as  the  (narrowband)  Grating 
Compression  Transform  (NGCT).  More  generally,  it  is  of 
interest  to  apply  2-D  Fourier  analysis  to  time-frequency 
distributions  that  can  be  viewed  as  mixtures  of  both 
narrowband  and  wideband  spectrograms.  Examples  of  such 
mixed-resolution  distributions  include  the  auditory,  super¬ 
resolution,  and  cone-kernel  spectrograms  [11]  [12].  Towards 
this  end,  we  develop  in  this  paper  signal  models  for  the 
counterpart  wideband  spectrogram  in  the  context  of  the  GCT 
(WGCT)  (Figure  1),  thereby  providing  a  more  complete 
interpretation  of  speech  signal  behavior  in  both  the  GCT 
framework  and  potentially  other  2-D  processing  schemes  and 
time-frequency  distributions. 
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Time 
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Figure  1.  Schematic  of  general  2-D  processing  framework  with  short-time 
analysis  followed  by  localized  2-D  analysis  for  narrow  (top)  and 
wideband  (bottom)  representations. 


In  our  development,  we  show  that  the  WGCT  is  distinct 
from  the  NGCT  in  interpretation,  thereby  motivating  a  novel 
taxonomy  of  speech  signal  behavior  in  2-D  processing  of 
speech.  We  also  show  that  the  WGCT  can  be  used  in  speech 
signal  processing  via  sinusoidal-series-based  demodulation  as 
in  [10]  to  motivate  spectrogram  analysis/synthesis  methods. 
To  assess  the  ability  of  the  model  to  represent  speech  content, 
we  evaluate  these  methods  for  reconstruction  of  wideband 
spectrograms  and  as  an  example  application,  build  on  previous 
work  in  [10]  in  using  the  WGCT  for  co-channel  speaker 
separation  with  prior  pitch  information.  In  this  context,  we 
emphasize  our  focus  on  assessing  the  signal  models’ 
representations  of  speech  rather  than  developing  a  complete 
separation  system. 

This  paper  is  organized  as  follows.  Section  n  reviews  the 
GCT  framework.  Section  III  develops  a  2-D  speech  signal 
model  for  stationary  voiced  speech;  Section  IV  describes 
extensions  to  non-stationary  voiced  speech  while  Section  V 
discusses  models  for  noise  and  onset/offset  content.  Section 
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VI  presents  a  taxonomy  of  speech  signal  behavior  in  the 
WGCT  and  NGCT.  Section  VII  describes  methods  for 
spectrogram  reconstruction  and  speaker  separation.  Sections 
VUI  and  IX  present  our  results  and  conclusions,  respectively. 


observe  distinct  behaviors  in  each  region  and  their 
corresponding  WGCTs  based  on  the  proximity  to  the  first 
formant.  Subsequently,  we  argue  for  a  set  of  models  with 
general  form  of  (3)  to  characterize  these  behaviors. 


II.  Framework 

Here,  we  review  the  Grating  Compression  Transform 
(GCT)  framework.  Consider  the  short-time  Fourier  transform 
(SIFT)  of  a  speech  signal  y[n]  using  a  window  w[n] 

Y(n,  <u)  =  £“=_«,  w[m  -  n]y[m]e~Ja>m .  (1) 

In  [10],  we  considered  w[n]  with  length  (L)  2-3  times  the 
pitch  period  P  of  voiced  speech  present  in  y[n],  resulting  in  a 
narrowband  spectrogram.  This  window  choice  leads  to 
harmonic  line  structure  oriented  across  frequency.  For  local 
time-frequency  regions  of  |F  (n,  to)  | 

|T(n,to)|iocaf  a  w[n,(o]H(n,co)E(n,a j)  (2) 


III.  Stationary  Voiced  Speech  Modeling 
A.  Single-Formant  Modeling 

Consider  a  simple  model  of  speech  in  which  an  impulse  train 

p[n]=INkk=0S[n-kP]  (4) 

with  periodicity  P  and  Nk  terms  excites  a  single  formant 
modeled  as  a  decaying  sinusoid  (and  Fourier  transform) 

h[n]  =  %fe~afn  cos(n^n)  u[n],  (5) 

H(")  +  . r"J(L,>  (6) 


where  w[n,  to]  is  the  2-D  window,  H(n,co )  is  the  vocal  tract 
formant  envelope,  and  E(n,i u)  is  a  2-D  sinusoidal-series 
carrier  dependent  on  pitch  and  pitch  dynamic  content.  In  the 
GCT  domain,  the  model  results  in  distribution  of  the  envelope 
(Figure  1).  Similar  behavior  was  argued  for  unvoiced  speech 
and  onset/offset  content. 


Zf,  otf,  and  oj(  are  the  amplitude,  decay  rate  (corresponding  to 
formant  bandwidth),  and  formant  frequency,  respectively.  We 
analyze  the  resulting  signal 

y[n]  =  Z"*0fc[n-fcP]  (7) 
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Figure  2.  (a)  Wideband  spectrogram  of  real  speech  male  utterance 
“needs”  illustrating  analysis  near  the  first  formant  («  ~  0. 05)  (b)  small 
A  (red),  large  A  (green)  and  an  (c)  “edge”  case  (white);  (d  -  f)  WGCT 
representation  of  three  regions;  note  off-axis  terms  in  (e);  (d  -  e) 
computed  for  regions  including  time  slices  in  (b);  see  discussion  of 
simulations  for  WGCT  computation  details. 


This  paper  considers  w[n]  with  L  <  P,  such  that  w[n] 
analyzes  y[n]  within  a  single  period  P  voiced  speech  [12], 
This  window  choice  leads  to  harmonic  structure  oriented 
across  time  in  a  wideband  spectrogram.  A  model  for 
wideband  spectrograms  of  voiced  speech  is  proposed  in  [12] 


using  the  short-time  Fourier  transform  (STFT)  with  w[n]  of 
length  L  <  P  to  satisfy  the  wideband  constraint. 

Consider  the  filterbank  view  of  the  STFT  such  that  at  an 
analysis  frequency  co  =  (Of  +  A  [12], 


Y(n,  at)  =  (y[n]e  iwn)  *n  w[n]  (8) 


Y(n,t o)  = 

(Xk=o  ~  ^P]  *n  w [n] 


(9) 


By  linearity  of  convolution,  a  single  term  in  the  summation  is 
Y(n,o)-,k ) 

=  (h[n  -  kP]e~K“f+A)n)  *n  w[n] 
with  corresponding  Fourier  transform 

°-5V  + _ 

af+e)(u>-V  a^ej(u'-2Uf-A)J 

n  maps  to  of  through  the  Fourier  transform  and  is  distinct 
from  co.  Since  W(ot')  is  concentrated  near  co'  =  0  and  nearly 
zero  far  away  from  co'  —  0  origin  (i.e.,  at  —  2a)f  +  A), 


(10) 


(11) 


|F(n,cu)|  =  E|>i]//(£d)  (3) 

where  E\n\  is  a  time-dependent  term  “energy”  term  and  H(of) 
is  a  “smoothed”  version  of  the  true  formant  envelope.  Figure 
2  shows  analysis  of  several  local  time-frequency  regions  of  a 
wideband  spectrogram  computed  for  voiced  speech.  We 


K(o»',£j;fc)  » 

ejkP{o>f+E)e-j^kPW^  ^  _a) .  (12) 
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Figure  3.  (a)  Fourier  transform  of  impulse  response  (green)  and  window 
(red);  (b)  small  A  case  with  majority  of  demodulated  formant  near  origin 
is  within  window  filter;  modulated  formant  at  at'  =  2atf  +  A  excluded  by 
window  filter;  (c)  large  A  case  with  tail  of  formant  content  within 
bandwidth  of  window  Alter;  at'  =  2otf  +  A  component  not  shown. 

We  consider  two  limiting  conditions  of  “small”  or  “large” 
values  of  A  and  derive  modulation  representations  in  both 
cases  (Figure  3). 

Small  A:  Applying  the  inverse  Fourier  transform  to  (12), 


Y(n,ay,k)  » 

(o.S  f/e;w>(<a/+A))  (13) 

w[n]  *„  (e~af(n-kp)u[n  -  kP]e^n~kp^. 

For  small  A,  e-;nA  fluctuates  slowly  in  time,  and  we  therefore 
approximate  it  as  e -;nA  «  cos  (A)  +jrstn(A).  Furthermore, 
we  assume  that  cos  (A)  dominates  e  --,nA  for  small  A  and 
jsin(A)  «s  0.  We  then  have 

Y(n,  ay,  k)  «  y(tu)e^p^/+A)e[n  -  kP]  (14) 
y(n>)  =  0.5 $fcos  (at  -  (Of)  (15) 

e[n-kP)=w [n]  *n  (  , 

u[n-kP],  {  ) 


In  (15),  we  have  rewritten  cos(A)  as  cos(n>  —  at y)  since 
A  =  a)  —  U)f.  Note  that  if  A  =  0,  (14)  holds  with  equality  with 
y(<y)  =  0.5^.  Returning  to  the  summation  over  k,  we  obtain 

Y(n,a 0 

«  Z"=o  eJkp^r+^y(co)e[n  -  kP}.  (17) 


If  e[n  —  kP]  decays  to  zero  within  each  period,  the  magnitude 
of  the  sum  may  be  approximated  as  the  sum  of  the 
magnitudes,  i.e., 


|F(n,A>)|Joca[  *  w[n,a>]Y(.<o)Ed[n ]  (18) 

Ed[n]  -  £?*  e[n-  kP] 

A  ±  ,  (19) 

=  S[J0  Acos  (— n  +  tM 


analysis  window  w[n,u)]  to  emphasize  analysis  in  a  local 
time-frequency  region  of  the  wideband  spectrogram. 

Our  derivation  argues  for  a  modulation  model  as  in  (3)  with 
a  sinusoidal  series  carrier  Ed[n]  representing  source 
periodicity  and  formant  bandwidth/decay  rate  (Figure  2b)  and 
envelope  y(w)  representing  frequency-dependent  scaling  of 
the  formant  peak  in  the  spectral  domain.  It  can  be  shown  from 
an  alternative  Fourier  transform  view  of  the  STFT  that  one 
interpretation  of  this  scaling  is  smoothing  of  the  true  formant 
spectrum  with  the  Fourier  transform  of  the  window, 

y(o>)  «  =  | W(to)  **  tf(w)|.  (20) 

We  refer  the  reader  to  Appendix  I  for  a  discussion  of  this 
derivation  and  illustrate  subsequently  through  simulations  its 
limitations.  Note  that  if  the  bandwidth  of  W(<u)  is 
substantially  greater  than  that  of  //(&>),  then  the  bandwidth  of 
H(cj)  effectively  becomes  that  of  the  window. 

Since  Ed  [n]  and  y(cu)  are  separable  in  (18),  its  2-D  Fourier 
transform  (i.e.,  the  WGCT)  is 

Y(v,0)  =  W(w,ft) 

KS(v)  + 

1,1=!  o.spteT^vs  (u  ± 

where  n  and  at  map  to  v  and  El,  respectively,  and  r|(fr) 
(l^(u,n))  is  the  Fourier  transform  of  H(a))  (w[n,  tu]).  ri(fl) 
is  the  WGCT  representation  of  the  smoothed  formant 
envelope  in  a  local  time-frequency  region.  Copies  of  T((fl)  are 
weighted  by  /?(  coefficients  (representing  the  bandwidth  of  the 
formant)  along  the  u-axis  at  multiples  of  —  (representing  the 
source  periodicity).  This  product  is  further  smoothed  in  v  and 
ll  with  the  Fourier  transform  of  the  2-D  analysis  window. 
Note  that  formant  bandwidth  along  the  ni-axis  “lost”  due  to 
smoothing  by  the  short-time  analysis  window  is  “recovered” 
in  time  and  represented  in  the  carrier. 

The  present  and  subsequent  formulation  motivates 
modulation/demodulation  framework  for  speech  signal 
processing  similar  to  [10],  Since  copies  of  are 

distributed  in  the  WGCT  space  via  the  carriers,  they  may  be 
demodulated  to  reconstruct  the  q(H)  term  at  the  WGCT  origin 
if  this  component  is  corrupted  e.g.  from  an  interfering  signal. 

Large  A:  For  large  A,  the  approximation  in  (14)  does  not 
hold.  at  =  o)f  +  A  is  “far  away”  from  atf ,  and  we 
alternatively  assume  that  the  frequency  response  of  the 
formant  is  approximately  a  single  complex  value  y'(w)  (i.e.,  a 
flat  spectrum)  (Figure  3).  The  frequency  domain 
interpretation  of  this  from  (1 1)  is 

Y((o’,  co;  k)  =  y,(«)(e-/k,,<"'+">W («/)■  (22) 


Inverting  (22)  and  invoking  the  summation  as  in  (17), 


where  y(tn)  is  assumed  to  be  non-negative  for  “small”- A  e.g., 
0  <  | A |  « |,  and  we  have  rewritten  Ed[n]  as  a  sinusoidal 
series  expansion.  Here,  we  have  also  introduced  a  2-D 


Y(n,  ay,  fc)  = 

n  —  kP )  *n  w[n]  = 

Z  Nkky'((o)e-ik^w[n-kP] 
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Since  L  <  P,  the  summed  terms  do  not  overlap  in  time.  In  a 
local-time  frequency  region  analyzed  with  a  2-D  window 
w[n,  <w],  the  magnitude  of  the  sum  can  be  rewritten  as  a  sum 
of  magnitudes,  i.e., 

|F(n,<i>)|iDcat  =  w[n,  m]|y'(w)|Fw[n]  (24) 

Ew[n\  =I^w[n-/cP]  (25) 


+  Ew[n]R[co  -  u>0]). 

The  2-D  Fourier  transform  (WGCT)  of  (29)  is  (Figure  4) 


7(u,n)  =  VF(u,n)  *vA 

(  KRi1S(v)  +  \ 

lieW.w)  TlR.iW  0.5  PlRlS  (u  ±  ?y)) 


(30) 


resulting  again  in  a  modulation  model  of  the  spectrogram  with 
a  source  periodicity-dependent  carrier  Ew[n]  and  an  envelope 
|y'(w)|  which  we  again  interpret  as  /?(n>)  from  (20).  An 
analogous  WGCT  representation  is  (Figure  4) 


Y(v,Si)  =  W(v,Sl) 

(  KS(v )  +  V 

ti'Cn)  O'Zp'^wivs  (u  ±  j 


(26) 


where  q'(ft)  is  the  Fourier  transform  of  |y'(m)|  and  /?'f  and  ip' 
parameters  of  the  sinusoidal  series  representation  of  Ew  [n] . 
While  the  WGCT  domain  contains  copies  of  q'(fl)  reflecting 
smoothed  formant  structure  in  local  time-frequency  regions  as 
in  the  small  A  case,  carrier  positions  and  corresponding  gain 
terms  reflect  source  periodicity  only. 


Figure  4.  (a)  Wideband  spectrogram  schematic  illustrating  analysis  of  a 
single  formant  in  distinct  frequency  regions  (1)  large  A,  (2)  small  A,  (3), 
“in  between”  case;  periodicity  and  bandwidth-dependent  carrier  (bine, 
shaded),  periodicity-dependent  carrier  (dotted  lines)  and  composite 
carrier;  (b-d)  WGCT  of  regions  1-3  with  distinct  modulated  envelopes 
delineated:  small  A  -  red,  large  A  -  yellow,  “in  between”  -  graded. 


Composite  Carrier:  Our  discussion  thus  far  has  described 
for  limiting  cases  of  A  modulation  models  in  time-frequency 
regions  of  wideband  spectrograms.  To  account  for  values  of  A 
“in  between”,  we  propose  a  “composite”  carrier 

Ec[n,a)\  — 

Ed[n]R[u}]  +  Fw[n]fl[w  -  w0] 

R  [oj]  =  1, 0  <  co  <  M 
0,  otherwise 


(27) 

(28) 


where  is  the  Fourier  transform  of  H(<u)f?(m)  (i  =  d) 

and  H{oY)R{p}  —  a>0)  (i  =  w).  KRi  and  filR.  is  a  complex 
coefficient  corresponding  to  the  sinusoidal  series  of  the  two 
carrier  types.  The  WGCT  contains  a  scaled  sum  of  rjfi  i(fl) 
terms  at  the  origin  and  carrier  locations.  If  the  bandwidth  of 
TiRit  (fl)  vv  are  such  that  0.5vv  <  -p  then  their  modulated 
copies  will  occupy  distinct  regions  along  the  u-axis  (Figure  2). 
Note  that  this  model  does  not  impose  constraints  on  the 
bandwidth  along  the  fl-axis. 

The  WGCT  also  invokes  a  mapping  of  pitch  f0  information 

27T 

Vo=fo-7-  (31) 

Is 


where  fs  is  the  sampling  frequency  of  the  waveform.  If  the 
time  width  of  the  local  time-frequency  region  is  be  2-3  times 
the  pitch  period  [12],  the  resulting  the  WGCT  exhibits  distinct 

TLiilc 

copies  of  the  envelope  at  multiples  of  — ;  f0  is  inversely 
related  to  the  number  of  terms  in  the  WGCT. 

B.  Multiple  Formants 

For  multiple  formants,  we  generalize  (1 1)  as  the  summation 


na>',ay,k)  =  W(m') 
o.sf 

.;(»'- a)  + 


yNf  jkP(ti)'  +o>f+k)  I  af+e 

e  _ 2d 


■5f 


\af+e 


(32) 


where  is  the  number  of  formants.  Assuming  that  the  u>f 
are  well-separated  in  frequency,  we  approximate  F («',&);  fc) 
as  being  dominated  by  a  single  formant  in  local  frequency 
regions.  Consequently,  identical  arguments  can  be  applied  as 
in  the  previous  sections  to  arrive  at  modulation  models  for 
individual  formants.  This  invokes  a  sum-of-magnitudes 
approximation  for  the  magnitude 

|F(n,Aj)|tocaj  « 
w[n,  w]  lIfirEc[n,  co;f]Hf^co) 


where  ftf  ranges  from  0  to  the  full  length  MfuU  region  in 
frequency,  and  co0  is  a  shift  in  frequency.  A  similar  composite 
carrier  can  be  obtained  by  interchanging  Ew[n]  and  Ed[n], 
Ec  [n,  <c]  may  be  modulated  by  W(cu)  to  invoke  a  modulation 
interpretation  as  in  the  limiting  cases.  A  generalized 
modulation  model  in  local  time-frequency  regions  is 


where  Ec  [n,  o>;  /]  and  are  formant  specific.  This 

model  interprets  local  regions  of  the  wideband  spectrogram 
magnitude  as  a  sum  of  modulation  products.  The  WGCT 
Y (v,  fl)  is  a  summation  of  terms  as  in  (33)  for  each  formant 

F(u,n)  =  W(u,n)  *„,n 


|l'(n,w)|!ocai  =w/[n,w]//(tu) 


(29) 
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E(<r(d,iv)  r]R,l(ft'>D 


/  kruS(v)  +  \ 


|y(n,  CO)!  and  replicate  this  across  time  to  obtain  a  reference 
estimate  of  the  smoothed  envelope  term  H(n,  co)  «  Hr(n,  co). 
Subsequently,  we  compute 


where  rjK  i(fl;/),  KR  i  j,  and  PiRUf  are  now  formant-dependent 
versions  of  those  in  (30).  We  expect  this  approximation  to  be 
best  for  frequency  regions  near  a  formant  peak,  e.g.,  (Of  —  A  < 
co  <  (Of  +  A,  analogous  to  the  single-formant  case. 
Furthermore,  at  frequency  regions  far  away  from  formant 
frequencies,  the  summation  implies  dominance  by  a  “large  A” 
model  (24)  corresponding  to  a  single  formant.  Nonetheless,  if 
formants  interact  within  a  local  frequency  region,  the  model 
can  be  expected  to  less  accurate.  In  our  subsequent  analyses, 
we  show  the  effects  of  such  interactions. 

C.  Simulations 

Single  Formant:  Herein  we  illustrate  properties  of  the 
carrier  models  proposed  for  the  previously  described  small 
and  large  A  conditions.  We  synthesize  a  decaying  sinusoid 
h'[n]  with  Of  =  0.17T  corresponding  to  a  periodicity  of  20 
samples,  f  =  1,  and  af  =  0.01  (5);  h'[n]  is  excited  with  a 
pure  impulse  train  p'[n]  with  periodicity  P  =  77  to  generate 
y'[n].  Signals  are  synthesized  at  16  kHz  with  resulting  pitch 
(formant)  frequency  of  210  Hz  (800  Hz).  Wideband 
spectrograms  are  computed  using  a  Hamming  window  w[n] 
with  length  L  —  40  =  2.5-ms  Hamming;  to  account  for  an 
extremal  case  of  a  350-Hz  pitch,  L  can  be  chosen  in  general  to 
be  less  than  =  2.9-ms.  A  single-sample  frame  rate  and 
2048-point  discrete  Fourier  transform  (DFT)  is  applied  to  both 
y'[n ]  and  p'[n]  to  obtain  |y'(n,&>)|  and  |P'(n,at)|.  WGCT 
analysis  was  performed  using  region  sizes  of  37.5-ms  by  500 
Hz  extracted  with  a  2-D  Hamming  window  followed  by  a  512 
by  512-point  2-D  DFT.  Analogous  to  the  choice  of  L ,  2-3 
times  the  lowest  pitch  period  of  60  Hz  constrains  the  time 
width  to  -33  to  50  ms.  We  refer  the  reader  to  our  subsequent 
discussion  to  motivate  the  choice  of  frequency  widths. 

For  “small-A”,  we  extract  time  slices  from  |K(n,m)|  at 
oj  =  ajf  and  to  =  <of  +  A  with  A  —  0.03  13jt  (corresponding 
to  250  Hz),  co  =  (Of  (A  =  0)  represents  the  idealized,  carrier 
in  the  modulation  model  as  discussed  in  (14);  A  =  0.0313tt 
represents  a  “small-A”  condition.  Time  slices  are  normalized 
to  have  unity  amplitude  and  shown  in  Figure  5b.  We  plot 
absolute  differences  between  the  slices  and  compute  the  root- 
mean-squared  error  (RMSE)  across  time.  Consistent  with  the 
model,  both  time  slices  resemble  decaying  exponentials 
smoothed  by  the  window  with  the  A  =  0.03137T  case  having 
RMSE  of  -0.09  relative  to  the  A  =  0  case.  This  discrepancy 
is  presumably  due  to  phase  effects  ignored  in  modeling. 

For  “large-A”,  Figure  5c  shows  a  time  slice  extracted  at 
co  ~  0.57T  (i.e.,  “far  away”  from  <of).  We  also  plot  a  time  slice 
|P'(n,  6l>)  |  corresponding  to  periodically  summed  windows, 
i.e.,  the  idealized  carrier  Ew  [n] .  The  co  =  0.5n  time  slice 
closely  matches  Ew[n]  with  RMSE  of  -0.05. 

In  a  second  set  of  simulations,  we  explore  properties  of  the 
smoothed  formant  interpretation  of  the  envelope  term  of  the 
modulation  model  (20).  We  replicate  a  time  slice  |F(n,  co  = 
(Of')  |  across  all  frequencies  to  generate  a  2-D  carrier  E‘  (n,  co) 
and  compute  the  time  average  of  all  spectral  slices  in 


E'e(n,co ) 


|yr(n,ai)| 


,He(n,(o) 


|r'(n.&>)j 
C\n,(o)  ' 


(34) 


Figure  6  shows  E'e  (n,  co)  and  time  slices  corresponding  to  the 
decaying  and  window-based  carriers  in  regions  near  and  far 
from  the  (Of  respectively,  as  can  be  expected  since  Hr(n,co) 
varies  with  frequency  only.  In  addition,  He(n,<o )  is 
reasonably  matched  to  Hr(n,co)  in  frequency  regions  near 
though  not  for  (o  away  from  (Of.  This  is  consistent  with  our 
use  of  the  exponential  decaying  carrier  in  computing 
He(n, at).  In  addition,  He(n, (o)  exhibits  temporal  fluctuations 
in  energy  at  (of.  This  effect  reflects  the  fact  that  the  assumed 
envelope  Hr(n,co )  best  matches  in  time  regions  away  from 
excitation  impulses  (see  Appendix  I).  Quantitatively, 
normalized  spectral  slices  of  He(n,co)  exhibit  an  RMSE 
relative  to  Hr(n,  co)  up  to  -0.09. 
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Figure  5.  Wideband  spectrogram  of  (a)  decaying  sinusoid  excited  with  a 
pure  impulse  train  and  (b)  pure  impulse  train;  note  that  a  time  slice  of  (b) 
corresponds  to  periodically  summed  copies  of  the  short-time  analysis 
window;  (c)  time  slice  of  (a)  located  at  the  formant  peak  (red)  and  for  a 
small  A  value  away  from  the  peak  (blue);  absolute  difference  (green) 
between  the  two  curves;  (d)  as  in  (c)  but  for  the  idealized  pure  impulse 
train  time  slice  (red)  and  actual  time  slice  located  “far  away”  from  the 
formant  peak  (blue). 
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Figure  6.  (a)  Y(n,ODfy,  (b)  E'e(n,  <u);  (c)  time  slices  of  (b);  (d)  Hr(n,io ); 
(e)  Ht(n,  ro);  (f)  spectral  slices  of  (e)  and  (f);  RMSEs  in  (f)  computed 
between  normalized  spectral  slices  of  (e)  and  the  reference  estimate  in  (d). 


In  a  final  set  of  simulations,  we  assess  model  properties  in 
frequency  regions  in  between  the  limiting  cases  of  “small” 
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and  “large”  A.  Figure  6  shows  a  local  region  of  E'e(n, <o)  (34) 
centered  at  a>  =  0.2  3n  in  which  two  carriers  appear  to  interact 
within  the  same  local  region.  The  corresponding  WGCT 
contains  components  off  the  horizontal  axis,  violating  the 
assumption  of  a  strictly  time-dependent  carrier  (£d[n],£w[n]). 
From  (27),  we  set  each  half  of  the  region  in  frequency  to 
Ed[n)  and  Ew  [n] .  Observe  that  the  resulting  WGCT  of  this 
signal  does  indeed  exhibit  off-axis  similar  to  those  in  Figure 
7a.  In  Figure  7c,  we  show  the  result  of  summing  Ew[n]  and 
E[n]  without  applying  /?[<*)];  the  resulting  WGCT  does  not 
exhibit  off-axis  terms,  indicating  that  the  displacement  effects 
of  R[m]  corresponding  to  phase  terms  in  the  WGCT  are 
crucial  in  modeling  this  behavior.  This  can  understood  from 
(30)  by  noting  that  the  Fourier  transform  of  Ec[ti,cj]  has  the 
same  form  but  with  t]R  i(Q.)  replaced  by  the  Fourier  transforms 
of  #[<u]  and  R[<u  —  m0],  thereby  invoking  dependence  along 
H  in  the  WGCT  domain. 
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Figure  8.  (a)  Spectrogram  of  vowel  with  local  region  highlighted  (white); 
(b)  high-pass  Altered  version  of  (a)  for  use  in  reconstruction;  (c)  original 
local  region;  (d)  estimate  of  (d)  using  demodulation. 
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Figure  9  (a)  RMSE  as  a  function  of  frequency  widths  and  frequency 
points  analyzed;  (b)  RMSE  for  frequency  center  points  corresponding  to 
formant  frequencies. 


Figure  7.  (a)  Local  region  from  E'c(n,a>)  centered  at  -0. 23rr;  (b) 
“composite”  carrier;  (c)  carrier  obtained  from  direct  summation;  (d-f) 
WGCT  of  (a-c),  respectively  with  0  =  0  (line)  denoted. 

Multiple  Formants:  Herein  we  explore  properties  of  the 
sum  of  modulation  products  model  (33)  for  multiple  formants. 
In  addition,  we  implicitly  investigate  effects  of  downsampling 
the  spectrogram,  as  is  typically  done  in  implementation,  on  the 
model.  Furthermore,  we  motivate  a  choice  of  region  size  in 
WGCT  analysis  along  the  frequency  dimension. 

A  synthetic  vowel  generated  using  a  pure  impulse  train  p[n] 
with  a  250-Hz  pitch  is  filtered  with  a  stationary  formant 
structure  with  frequencies  (bandwidths)  669,  2349,  2972,  3500 
Hz  (65,  90,  156,  200  Hz)  to  generate  y[n]  (i.e.,  a  female  /ae/ 
vowel,  [13]).  Spectrograms  are  computed  as  in  the  previous 
section  though  a  frame  rate  of  10  samples  (i.e.,  ^).  In  addition, 
we  apply  a  high-pass  filter  to  the  spectrogram  and  aim  to 
recover  localized  regions  using  demodulation  with 
bootstrapping  as  alluded  to  in  Section  II1.A.  For  each  point 
along  the  w-axis,  we  extract  a  region  of  the  filtered 
spectrogram  of  time  length  37.5  ms  and  vary  the  frequency 
width  to  obtain  local  regions  (Figure  8d).  Using 

demodulation,  we  obtain  an  estimate  of  the  original  local 
region;  we  refer  the  reader  to  VII.  A  for  details  of  the  method 
and  focus  here  on  the  results.  We  compute  the  root-mean- 
squared  error  (RMSE)  between  the  estimate  and  original  2-D 
region  extracted  after  both  are  scaled  to  have  maximum  value 
of  unity  for  comparison  purposes. 


Figure  9a  shows  results  across  all  frequency  center  points 
and  widths  (df).  Figure  9b  shows  results  of  analysis  at  select 
center  frequencies.  Despite  the  presence  of  multiple  formants, 
RMSEs  for  reconstructions  centered  at  the  formant 
frequencies  do  not  exceed  -0.15  for  frequency  widths  ranging 
from  zero  to  0.1  rr  corresponding  to  -800  Hz  and  result  in 
reasonable  estimates  of  the  original  region.  RMSE  values 
generally  increase  up  to  a  local  maximum  for  larger  widths 
followed  by  a  modest  decrease.  At  frequency  regions  “far 
away”  from  formant  peaks,  e.g.,  at  oj  =  0.87T,  reconstructions 
also  follow  this  trend  though  substantially  less  growth  in 
RMSE  beyond  frequency  widths  of  O.Ijt;  this  is  due  to  the 
absence  of  interacting  formant  structure  in  these  frequency 
regions.  Conversely,  at  to  =  Q.Ztt,  the  slope  of  the  RMSE  is 
sharper  than  for  the  individual  and  cu  =  0.87T  case,  reflecting 
effects  of  formant  interactions  (here,  FI  and  F2). 

IV.  Extensions  to  Non-stationary  Voiced  Speech 

A.  Dynamic  Formants 

Modeling:  As  discussed  in  Appendix  I,  a  Fourier  transform 
view  of  the  wideband  spectrograms  argues  for  a  similar 
modulation  model  to  (29)  that  includes  formant  dynamics.  In 
the  time-frequency  space,  we  view  dynamic  formant  content 
as  a  rotated  rectangle  in  the  time-frequency  space  such  that 
the  2-D  Fourier  transform  is  the  rotation  of  the  2-D  Fourier 
transform  of  a  rectangle  from  image  processing  principles 
(Figure  18)  [10].  While  the  derived  model  is  posed  under 
relatively  restrictive  conditions  in  relation  to  time  segments 
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away  from  excitation  impulse  onsets,  herein  we  illustrate  with 
a  simple  example  that  the  model  can  nonetheless  provide  a 
reasonable  interpretation  of  dynamic  formants. 


frequency  regions  have  time  widths  such  that  pitch  values  are 
approximately  constant.  Subsequently,  we  quantitatively 
assess  the  effect  this  has  on  a  range  of  pitch  variations. 
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Figure  10.  (a)  Wideband  spectrogram  of  diphthong  with  local  region 
(white);  (b)  local  region  of  (a);  (c)  GCT  of  (b)  with  rotated  (white  line) 
envelope  structure  near  origin;  arrows  denote  demodulation  of  carrier 
terms  down  to  DC;  (d)  WGCT  of  demodulated  version  of  (c)  with 
comparable  rotated  components  to  match  that  in  (c).  In  (c)  and  (d),  DC 
value  is  removed  for  illustrative  purposes;  in  (d),  display  limited  to  near- 
DC  region  due  to  presence  of  cross  terms  in  demodulation. 
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Dynamic  Formant  Model  Simulations:  We  synthesize  a 
200-ms  diphthong  with  start-to-end  formant  frequencies 
(bandwidths)  of  669,  2349,  2972,  4000  Hz  (65,  90,  156,  200 
Hz)  to  437  2761,  3372,  4000  Hz  (38,  66,  171,  200  Hz).  The 
source  signal  is  a  pure  impulse  train  with  200-Hz  pitch.  The 
wideband  spectrogram  and  WGCTs  are  computed  as  in  the 
previous  section. 

Figure  lOa-b  shows  a  local  region  near  the  increasing 
second  formant.  Figure  10c  shows  the  corresponding  WGCT 
near  the  first  carrier  position;  for  display  purposes,  DC  values 
at  both  the  origin  and  carriers  have  been  removed.  At  these 
locations,  we  observe  rotated  components  corresponding  to 
the  local  envelope  structure  present  in  Figure  10b;  the  rotation 
of  these  components  can  be  quantified  by  measuring  the  angle 
of  the  near-DC  peaks  relative  to  the  ft-axis  of  -0.24  radians. 

As  noted  in  Section  in.A,  demodulation  of  envelope 
content  from  carrier  positions  may  be  used  to  recover  near-DC 
terms  in  the  WGCT.  Figure  lOd  shows  an  example  of 
demodulating  the  carrier  components  in  Figure  10c  to  DC. 
Since  in  reconstruction  we  further  remove  any  resulting  cross 
terms  by  low-pass  filtering  (see  Section  VII),  we  restrict  our 
display  to  the  near-DC  regions  here.  A  set  of  rotated 
components  are  obtained  at  DC  with  angle  -0.23  radians  to 
match  those  at  DC  in  Figure  10c.  These  results  are  consistent 
with  a  generalized  2-D  envelope  H(n,  a>)  as  argued  in 
Appendix  I  in  relation  to  the  modulation  model. 

B.  Time-varying  Pitch 

Model :  Time-varying  models  of  pitch  have  been  explored 
by  a  number  of  researchers  such  as  in  [14].  In  the  short-time 
spectrum,  the  behavior  of  time-varying  impulse  has  been 
described  qualitatively  as  “blurring”  (i.e.,  widening)  of 
harmonic  peaks  near  the  “average”  pitch;  this  effect  may  be 
interpreted  as  multiple  peaks  in  the  spectrum  corresponding  to 
a  Bessell  function  expansion  [12].  In  our  present 
development,  we  impose  the  constraint  that  local  time- 


Figure  11.  (a)  Wideband  spectrogram  of  changing  pitch  with  time 
segment  (37.5  ms)  denoted  (white);  (b)  WGCT  of  full  time  slice  (maroon) 
and  time  segment  of  (a)  (black);  peaks  obtained  in  direct  mapping  (blue) 
and  bootstrapping  (green);  (c)  RMSE  of  reconstructions  using  direct 
versus  bootstrapping  methods;  (d)  reconstruction  of  1  Hz/ms  case  with 
truth  (red),  direct  (blue),  and  bootstrapping  (green)  denoted. 


Time-varying  Pitch  Simulations:  We  synthesis  impulse 
trains  of  duration  200  ms  with  linearly  increasing  pitch  (varied 
fro  mO  to  1  Hz/ms)  with  starting  pitch  value  of  175  Hz.  In 
analysis,  we  compute  the  wideband  spectrogram  and  attempt 
to  resynthesize  a  full  time  slice  from  time  segments  of  size 
37.5  ms  using  1-D  WGCT  analysis  (Figure  11).  We  extract 
“peak”  locations  in  the  WGCT  to  resynthesize  a  sinusoidal 
series.  Peak  locations  are  determined  using  either  1)  the  direct 
pitch  information  mapping  of  (31)  or  2)  bootstrapping  of  the 
peak  locations  (Figure  I  lb).  In  the  former  method,  the  pitch 
value  defined  at  the  center  of  the  time  segment  is  used;  in  the 
latter,  the  mapped  locations  are  reassigned  using  a  1-D  multi¬ 
peak  picker  applied  to  the  WGCT  (see  Section  V  and  [10]). 

The  resulting  WGCT  shows  that  the  direct  mapping  can 
result  in  “peak”  locations  that  appear  harmonically  realted  but 
deviate  from  the  actual  WGCT  peaks  (Figure  lib).  As  an 
extremal  example  of  the  variation  in  peak  location  with  time- 
varying  pitch.  Figure  lib  shows  a  GCT  computed  for  the  full 
time  slice;  we  observe  two  peaks  with  substantially  widened 
bandwidths  consistent  with  the  previously  described  Bessel- 
like  behavior.  We  compute  the  root-mean-squared  error 
(RMSE)  between  normalized  estimates  and  true  time  slices. 
Figure  11c  shows  that  RMSE  increases  dramatically  using  the 
direct  method  for  rates  >  -0.1  Hz/ms  in  contrast  to  the 
bootstrapping  technique.  At  a  rate  of  1  Hz/ms,  bootstrapping 
(RMSE  =  -0.16)  maintains  the  aperiodicity  of  the  signal  while 
the  direct  mapping  (RMSE  =  -0.54)  deviates  substantially 
motivates  a  bootstrapping  approach  to  obtain  earner  locations 
that  may  not  correspond  exactly  to  the  pitch  mapping  of  (31). 


V.  Noise  and  Onsets/Offsets  Models 
A.  Noise 

Model:  We  consider  now  modeling  of  noisy  signals  (e.g., 
fricatives)  in  the  WGCT.  The  analytical  form  of  the  WGCT 
model  of  noise  is  identical  to  that  presented  for  the 
Narrowband  GCT  (NGCT),  and  we  refer  the  reader  to  [10]  for 
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more  details  while  focusing  on  empirical  behavior  of  noise  in 
the  WGCT  in  this  section. 


(a)  SfMdrogjifi)  &  Whit  Noaa  (hi  S«^*R*gion 


Figure  12.  (a)  Wideband  spectrogram  of  white  noise  (log-scale);  (b) 
WGCT  of  a  single  region;  (c)  ideal  average  power  spectrum;  (d) 
estimated  average  power  spectrum, 

A  zero-mean  independent  and  identically  distributed  (i.i.d.) 
Gaussian  process  w[t]  with  standard  deviation  <7  can  be 
analyzed  with  a  wideband  short-time  Fourier  transform 
magnitude  w[n,  <u] .  We  model  w[n,  m]  as  arising  from  a  2-D 
random  process  under  assumptions  of  i.i.d.  time-frequency 
units  with  Rayleigh  distribution.  The  (average)  2-D  power 
spectrum  of  w[n,  <u]  is  then  [10], 


Sww,gct(.v’&)  — 


—gl  +  —  G*5(y,{X) 

*v.n  iw(u,a)i2 

(35) 

=  jff2|W(u,H)|2  +^a2p 

p  =  fj^jw(v,a)\2dvda 

(36) 

where  W(u,fl)  is  the  2-D  Fourier  transform  of  the  2-D 
window  used  to  extract  localized  regions  of  iv[n,  cu]. 

To  obtain  an  instantaneous  model,  we  invoked  in  [10]  the 
Karhunen-Loeve  expansion  under  the  assumption  of  distinct 
frequency  bands  of  the  filterbank  view  of  the  2-D  Fourier 
transform  [12],  Specifically,  a  sum  of  arbitrary  sinusoids  on  a 
DC  pedestal  was  viewed  as  the  carrier  component  in  the 
modulation  model  of  (29)  (and  corresponding  WGCT) 

\Y(n,  cu)  |  =  w[n,  ot\H[n,  a]E  [n,  <u]  (37) 

£(n,  &>)  —  K  +  ak  cos(4>k{n, «])  (38) 

<pk[n,  to]  =  nk(ncosG  +  osinQ )  +  <pk  (39) 
T(u,n)  =  W(v,n)  *v,n 

(yNc  Kt](v,S2)  +  \  /^qn 

o,5afcjj(u  +  S2kcos6,D  ±  fiksin  0)/ 


with  Nc  as  the  number  of  carriers,  Slk  is  the  spatial  frequency 
of  the  2-D  sinusoid,  6  its  orientation,  and  cpk  its  phase  term. 
Here,  we  have  where  we  have  allowed  for  a  2-D  envelope 
H[n,  ui]  as  in  the  time- varying  formant  condition.  As  in  the 
voiced  case,  this  model  argues  for  a  distribution  of  envelope 
content  in  the  WGCT  space  at  carrier  locations  (Figure  15f)- 


Tlma(n)  Tn»(n) 

Figure  13.  (a)  Original  spectrogram  of  noise-excited  vowel;  (b)  low-pass 
filtered  version  of  (a)  resulting  in  envelope  term;  (c)  reconstruction  after 
high-pass  filtering  and  demodulation;  RMSE  computed  between  (c)  and 
(a);  (d)  low-pass  filtered  version  of  (c)  indicating  recovery  of  low-pass 
envelope  term  in  (b);  RMSE  computed  between  (d)  and  (b). 

Simulations:  Herein  we  compute  the  empirical  2-D  power 
spectra  of  white  noise  in  wideband  spectrograms  for 
comparison  to  the  proposed  model.  Figure  12a  shows  a 
wideband  spectrogram  computed  for  w[t]  with  a  —  1. 
WGCT  analysis  was  performed  using  the  parameters 
described  previously  for  vowels.  Figure  12d  illustrates  the 
power  spectrum  obtained  from  averaging  all  regions  analyzed. 
While  the  model  captures  the  dominance  of  the  near-DC 
region  of  the  WGCT,  it  fails  to  capture  substantial  2-D 
spectral  shaping  effects.  Figure  12b  shows  WGCT  analysis 
results  for  a  single  region,  consistent  with  the  averaged 
spectrum  in  Figure  12d.  The  estimated  spectrum  has  a 
substantia]  component  along  the  u-axis  due  to  correlation 
across  the  frequency  axis  (w)  in  the  wideband  spectrogram 
such  that  we  observe  vertical  striations  (Figure  12a)  across 
time.  Specifically,  the  short-time  spectrum  is  substantially 
smeared  across  at  due  to  the  relatively  short  length  of  the 
window  (and  therefore  wide  bandwidth  in  the  spectrum).  This 
behavior  is  the  “dual”  of  the  narrowband  GCT  that  exhibited 
components  along  the  ft-axis  due  to  temporal  correlation 
effects  of  processing  noise. 

In  a  second  set  of  simulations,  we  aim  to  assess  the  extent  to 
which  (37)  can  represent  noise  speech.  We  compute  the 
wideband  spectrogram  of  a  vowel  with  formant  structure  as  in 
the  previous  sections  but  excited  with  Gaussian  white  noise. 
Next,  we  adopt  the  framework  of  the  the  simulations  for 
multiple  formants  in  removing  DC  components  in  the  WGCT 
with  the  aim  of  approximately  reconstructing  them  through 
demodulation  (Section  m.A,  Section  V).  Figure  13  shows 
reconstruction  results  and  low-pass  filtering  of  the  original  and 
reconstruction.  Observe  that  the  reconstruction  results  in 
recovery  of  the  low-pass  envelope  to  match  that  of  the  original 
spectrogram;  this  is  consistent  with  the  demodulation  process 
recovering  the  near-DC  terms  of  the  WGCT  from  its 
distributed  copies  due  to  the  carrier. 

B.  Onsets/Offsets 

Model:  Similar  to  the  noise  case,  herein  we  briefly  describe 
onset/offset  content  observed  in  wideband  spectrograms  and 
similar  to  that  observed  for  the  narrowband  case  [10].  An 
isolated  impulse  i[n]  =  <5[n  —  W0]  located  at  N0  can  be 
modeled  as  a  downsampled  short-time  analysis  window  wt[n] 
in  the  spectrogram  domain  (denoted  as  /[n,  oj]) 
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l[n,  to]  =  wn[W0  -  nJV]  (41) 

where  N  is  the  frame  rate  of  the  STFT.  The  GCT  is 

l(v,  fi)  =  W(v,  fl)  *„  Wn'  g)  e»N°  (42) 

where  denotes  convolution  in  the  GCT  domain  and 
W(u,  fi)  is  the  2-D  Fourier  transform  of  a  2-D  window 
w[n,  &>]  used  to  extract  a  localized  time-frequency  region. 
We  view  7(u,  fl)  as  an  envelope  term  in  the  modulation  model 
of  (29)  in  the  context  of  a  carrier  due  to  voiced  (e.g.,  (27))  or 
noisy  speech.  As  with  formant  envelopes,  we  impose  a  a 
bandlimited  constraint  on  in  the  context  of  modulation 

(30).  In  the  wideband  case,  I(v,  fl)  will  have  larger  bandwidth 
than  in  the  narrowband  case  due  to  the  sharpness  of  the 
representation  in  time.  Specifically,  wideband  parameters  are 
a  2.5-ms  (L  =  40)  short-time  analysis  window  and  frame  rate 
of  0.625  ms.  This  results  in  the  Hamming  window  mainlobe 

Wn"  g)  width  of  4  —  0.87T  [10];  in  contrast,  a  32-ms 
window,  1-ms  frame  rate  in  the  narrowband  case  results  in  a 
mainlobe  width  of  (— )  25  =  0.39067T. 

V.512/ 


Figure  14.  (a)  Spectrogram  of  voicing  and  noise  onset;  (b)  reconstruction 
of  (a);  (c)  low-pass  filtered  version  of  (a)  demonstrating  onset/offset 
envelopes;  as  in  (c)  but  for  the  reconstruction  in  (b);  associated  RMSEs 
computed  after  normalization  in  all  cases;  log  spectrograms  plotted  to 
emphasize  widening  effects. 

Simulations:  Figure  14.  shows  results  of  synthesizing  and 
reconstructing  voicing  and  noise  onset/offsets  using 
demodulation  (as  done  in  the  noise  case).  The  reconstruction 
in  Figure  14.b  exhibits  widening  of  the  onsets  as  may  be 
expected  from  the  bandlimited  nature  of  the  analysis/synthesis 
method.  Nonetheless,  this  widening  is  consistent  with  the 
envelope  obtained  in  low-pass  filtering  the  original  signal  in 
Figure  I4.c  and  as  can  be  shown  in  filtering  the  reconstruction 
in  Figure  14.d. 

VI.  A  Taxonomy  of  Speech  Signal  Behavior  in  the  GCT 

Our  discussions  motivated  a  modulation  view  of  the 
wideband  spectrogram.  Specifically,  in  voiced  regions,  the 
wideband  spectrogram  can  be  viewed  as  summation  of 
modulation  components,  where  each  component  corresponds 
to  a  formant.  A  carrier  Ec[n,  tu]  is  dependent  on  source 
periodicity  and  (under  certain  conditions)  formant  bandwidth 


and  is  modulated  by  a  smoothed  ( single )  formant  or  envelope 
|S,(no  ,  w)|.  Noise  and  onsets/offsets  are  viewed  in  this 
framework  as  carrier  and  envelope  components,  respectively. 


Figure  IS.  Narrow  (top)  and  wideband  (bottom)  representations  of:  (a,  b) 
stationary  formant  and  pitch,  (c,  d)  stationary  pitch  and  dynamic 
formant,  and  (e,  f)  noise  content. 


Figure  16.  Narrow  (top)  and  wideband  (bottom)  representations  of:  (a,  b) 
dynamic  pitch  and  stationary  formant,  (c,  d)  dynamic  pitch  and  dynamic 
formant,  and  (e,  f)  onset/offset  content. 

This  signal  model  has  some  similarity  to  that  proposed  for 
narrowband  spectrograms  [10],  and  in  subsequent  sections,  we 
assess  its  ability  to  represent  speech  content  using  algorithms 
similar  to  those  used  in  [10].  Nonetheless,  there  important 
distinctions  exist  in  the  form  and  interpretation  of  the  two 
models,  and  in  Figure  15  -  Figure  16  we  compare  the  mapping 
of  changing/stationary,  pitch/formant,  and  noise  and 
onset/offset  content  for  both  representations. 

For  voiced  speech,  stationary  pitch  mappings  in  the  NGCT 
and  WGCT  are  “duals”  of  each  other  along  the  fi  (narrow)  and 
v  (wide)  axes,  as  schematized  in  Figure  15a-b.  This  mapping 
distinction  is  preserved  even  when  formant  dynamics  are 
introduced  (Figure  15c-d).  In  contrast,  pitch  dynamics 
invokes  a  rotation  of  components  in  the  NGCT  while  invoking 
widening  of  the  formant  content  along  the  u-axis  in  the 
WGCT  due  to  the  presence  of  widened  harmonic  content  of 
the  carrier  (Section  DLC),  as  schematized  in  Figure  16a-d.  An 
additional  narrowband/wideband  “duality”  is  observed  in 
mapping  noise  to  the  GCT  domain  with  components  along  the 
n  (narrow)  and  v  (wide)  axes  (i.e.,  v  =  0  and  fl  =  0), 
respectively  (Figure  15e-f.  Finally,  the  WGCT  exhibits 
greater  bandwidth  of  onset/offset  content  relative  to  the  NGCT 
due  to  differences  in  short-time  analysis  resolution  (Figure 
16e-f). 

Table  1  presents  a  taxonomy  of  speech  signal  behavior  as 
represented  in  the  narrowband/wideband  models.  We  denote 
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H(n, a>)  as  the  formant  structure,  g{.)  as  a  general  function, 
and  coc  as  the  center  frequency  of  the  local  region  analyzed. 
Several  distinctions  include  the  summation  of  (WGCT)  vs. 
singular  modulation  products  (NGCT)  and  single  (NGCT)  vs. 
multiple  carrier  types  (WGCT);  in  addition,  carriers  have 
distinct  dependencies  on  source  periodicity  f0  (NGCT, 
WGCT),  pitch  dynamics  ^  (NGCT),  formant  bandwidth  ay 
(WGCT),  and  (oc.  “Dual”  behavior  exists  in  pitch  mappings 
between  the  two  GCTs;  specifically,  high  pitch  values  results 
in  low  (i.e.,  near  GCT  origin)  frequency  components  in  the 
NGCT  and  high  frequency  components  in  the  WGCT.  This 
effect  also  results  in  the  difference  in  number  Np  of  harmonic 
terms  in  the  GCT  as  they  relate  to  pitch.  While  noise  is 
viewed  as  a  carrier  term  in  modulation  in  both  representations, 
its  localization  is  distinct  between  the  two  as  previously  noted. 
Finally,  onsets/offsets  are  interpreted  as  envelope  terms  in 
both  cases  though  with  differences  in  bandwidth  va  along  the 
u-axis. 


Table  1  Comparison  of  signal  model  interpretations  for  narrow-  and 
wideband-based  Grating  Compression  Transforms. _ 


In  terpretatiorVGCT 

Narrowband 

Wideband 

Local  Model 

Y(n,oi ) 

=  H(n,a>)E(n,co) 

Y(n, «)  = 

Envelope  (vowels) 

®  |tff(<D,  n)  *  W  (o>)  | 

Carrier  (vowels) 

,  dfo  . 
E(n,w;f0,— ,&><) 

Ec(n,  Ctf,  &>c) 

=  g(Ed(n),Ew(n),R(a>)) 

/o  mapping 

v0  oc4r,'°>C 

v°xTo 

fo*Np 

NpKf" 

i 

NPK  — 

Ja 

Noise 

Along  v  =  0;  carrier 

Along  fi  =  0;  carrier 

Onsets/Offsets 

v„  =  0.39tt 

CO 

o 
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VII.  Spectrogram  Analysis/Synthesis  and  Co-channel 
Speaker  Separation 

Herein  we  describe  approaches  to  test  the  proposed  model’s 
ability  to  represent  speech  content  through  spectrogram 
analysis/synthesis  and  co-channel  speaker  separation.  As 
these  methods  are  generally  the  same  algorithmically  to  those 
in  [10]  and  [5],  we  refer  the  reader  to  those  works  for  details 
and  focus  here  on  the  general  framework  and  distinctions. 


Figure  17  (a)  Local  time-frequency  region  with  carrier  (orange)  and 
envelope  (shaded)  components;  (b)  corresponding  WGCT  with  candidate 
peaks  from  peak-picking  (“+’)  and  reassignment  of  directly  mapped 
carrier  locations  to  candidate  peaks;  ‘x’  denotes  removal  of  near-DC 
term;  (c)  demodulation  of  components  located  at  carrier  locations 


obtained  from  direct  mapping  for  reconstruction;  (d)  as  in  (c)  but  using 
reassigned  carrier  locations  (bootstrapping). 

A.  Analysis/Synthesis 

In  the  proposed  signal  model,  the  WGCT  domain  consists  of 
envelope  content  near  the  origin  and  at  carrier  locations 
(Figure  17)  due  to  sinusoidal-series-based  modulation.  As  a 
framework  for  reconstruction,  we  aim  to  approximately 
recover  the  near-DC  terms  in  the  GCT  using  their  modulated 
version  at  earner  locations  using  sinusoidal  demodulation  [5] 
[10]  (Figure  17).  Synthesized  carriers  are  multiplied  by  the 
local  region  followed  by  low-pass  filtering  to  invoke  the 
bandlimited  constraint  along  the  u-axis  of  the  envelope  terms 
in  the  GCT  domain  (30).  Demodulation  is  done  locally  across 
time-frequency  regions  via  a  least-squared-error  fitting 
method.  The  reconstructed  spectrogram  is  combined  with  the 
phase  of  the  original  signal  to  estimate  a  waveform  using 
overlap-add.  This  waveform  estimate  represents  an  “upper 
limit”  of  reconstruction  due  to  inclusion  of  the  phase  of  the 
original  signal. 

To  obtain  carrier  parameters  for  voiced  speech,  we  use  the 
pitch  mapping  (31)  in  conjunction  with  prior  pitch 
information.  In  contrast  to  the  narrowband  model,  note  that  a 
direct  mapping  forces  all  carriers  to  be  located  on  the  v-axis. 
For  unvoiced  speech  and  in  the  bootstrapping  method  to  be 
subsequently  discussed,  peak-picking  is  done  using  a  multi¬ 
peak  picker  similar  to  that  of  [10].  The  GCT  magnitude  is 
analyzed  by  a  series  of  binary  masks  to  extract  peak  locations 
based  on  a  point’s  neighbors  and  amplitude  thresholding. 

Carrier  assignments  for  demodulation  are  made  for  voiced 
speech  using  a  direct  method  with  mapped  locations  (for 
voiced  speech).  In  the  bootstrapping  method,  directly  mapped 
carrier  locations  are  reassigned  to  those  obtained  from  peak¬ 
picking  using  a  minimal  distance  criterion  in  an  iterative 
algorithm  (see  Section  m.C  of  [10]).  Noise  carriers  are 
assigned  based  on  peak-picking  in  both  direct  and 
bootstrapping  approaches. 

B.  Co-channel  Speaker  Separation 

As  mentioned  in  the  previous  section,  one  motivation  for 
analysis/synthesis  with  recovering  the  near-DC  terms  from 
their  modulated  versions  is  the  separation  (or  removal)  of 
interfering  speakers.  Specifically,  we  assume  according  to  our 
model  that  near-DC  terms  of  multiple  speakers  overlap,  while 
carrier  terms  often  do  not,  and  that  recovery  of  the 
(uncoirupted)  DC  region  must  be  consistent  with  modulation 
of  the  carriers. 

WGCT-Approach:  In  our  WGCT-based  approach,  we  apply 
least-squared-error  demodulation  using  the  sum  of  two 
modulation  models  to  fit  local  time-frequency  regions  of  the 
mixture  spectrogram;  as  in  [10]  [5],  this  framework  utilizes  a 
sum-of-magnitudes  approximation  to  the  mixture  spectrogram. 
Diagonal  loading  of  the  resulting  least-squares  matrix  was 
performing  using  a  threshold  value  obtained  from  a  held-out 
development  set  [10].  Carrier  parameters  are  obtained  as  in 
the  single-speaker  case  using  direct  mapping  and  peak¬ 
picking.  Permutations  of  mixture  voicing  conditions  are  used 
to  assign  carriers  to  distinct  speakers  for  demodulation  [10]. 
In  the  voiced  on  voiced  case,  the  pitch  mapping  of  (31)  is  used 
to  obtain  carrier  positions  that  are  used  directly  or  as  reference 
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values  for  bootstrappinghea&signment  as  in  the  single-speaker 
case  using  candidates  from  the  peak-picker.  In  the  voiced  on 
unvoiced  case,  the  direct  pitch  mapping  is  used  to  obtain  the 
voiced  speaker’s  carriers  while  the  unvoiced  speaker  is 
assigned  to  carrier  locations  from  peak-picking.  In 
bootstrapping,  the  voiced  speaker’s  carriers  are  first 
reassigned  while  the  remaining  candidate  carrier  locations 
from  peak-picking  are  assigned  to  the  unvoiced  speaker. 
Finally,  in  the  unvoiced  on  unvoiced  case,  carrier  positions 
from  peak-picking  are  used  to  fit  the  local  region;  the  resulting 
estimate  is  halved  and  assigned  to  both  speakers.  A 
distinction  of  the  WGCT  approach  from  the  NGCT  is  that  we 
apply  bootstrapping  of  the  carrier  positions  as  an  alternative 
method  to  the  direct  approach  instead  of  the  exclusion/re¬ 
estimation  method  described  in  [10]. 


Figure  18  (a)  Local  region  of  wideband  spectrogram  for  voiced  speakerl 
(red  lines,  shaded  blue)  and  voiced  speaker!  (purple  lines,  shaded  yellow) 
mixture;  (b)  corresponding  WGCT  with  removal  of  near-DC  terms  and 
demodulation  to  extract  speakerl;  (c)  voiced  speakerl  (red  lines,  blue 
shaded)  and  unvoiced  speaker2  (black  squares,  yellow  shaded)  mixture; 
(d)  WGCT  of  (c)  indicating  removal  of  near-DC  terms  and  demodulation 
to  recover  speaker2.  Note  that  demodulation  in  (b)  and  (d)  arc  illustrated 
for  the  direct  approach  though  this  is  done  similarly  in  bootstrapping, 


female  (FF),  and  64  female-male  (FM)  mixtures,  all  mixed  at 
0  dB  overall  signal -to-signal  ratio.  The  selected  pairs  cover  a 
large  range  of  overlapping  voiced  and  unvoiced  conditions, 
including  crossing  pitch  tracks.  Pitch  tracks  for  individual 
utterances  are  obtained  using  the  Wavesurfer  package  [16]. 

B.  Spectrogram  Analysis/Synthesis 

Wideband  spectrograms  (s^uti[n,m])  are  computed  as  in 
Section  m  GCT  analysis  is  done  using  a  2-D  512-point  DFT 
on  local  time-frequency  regions  of  size  500  Hz  by  37.5  ms 
extracted  using  a  2-D  Hamming  (overlap  factor  4).  A  high- 
pass  (low-pass)  1-D  filter  hhp  [n]  (htp  [n])  is  designed  using 
the  frequency  sampling  method  khp  [n]  (hIp[n])  of  order  80 
with  pass-band  (stop-band)  beginning  at 

O.Stij,  -  602<0-625*10~3x16000)  =  0.075tt  (44) 

0  16000 

corresponding  to  an  extremal  low-pitch  case  of  60  Hz  from 
(31)  (Section  II.C)  with  stop-band  (pass-band)  roll-off  to  vb. 
/thp[n]  is  applied  to  sfuLl  [n,oj]  to  obtain  sfuil:hp[n,(o]. 
sfullAp\n, «]  is  multiplied  by  a  set  of  sinusoidal  carriers 
followed  by  low-pass  filtering  by  hlp[n]  to  obtain  envelope 
estimates  which  are  used  to  fit  gain  parameters  in  a  least 
squares  formulation.  Note  in  demodulation,  we  used 
Sfuii.hp  [n> «]  instead  of  s^uli[n,  w];  this  was  observed  in 
preliminary  experiments  to  reduce  the  influence  of  cross  terms 
near  WGCT  origin  after  demodulation  such  as  in  the  case  of 
low-pitch  values  (e.g.,  for  males). 

As  metrics,  we  compute  root-mean-squared  errors  (RMSE) 

RMSE  = 

[*/»«! I". "]  " * "if 


Fusion  Methods:  From  Section  VI,  recall  that  the  number  of 
harmonic  components  in  the  GCT  depends  on  short-time 
analysis  window.  For  instance,  male  speakers  with  low  pitch 
exhibit  fewer  terms  in  the  NGCT  than  females  while  the 
opposite  is  true  for  WGCT;  this  effect  was  suggested  in  [10] 
as  contributing  to  differences  in  performance  in  both 
analysis/synthesis  and  separation.  It  is  conceivable  that  a 
fusion  of  separation  estimates  from  both  the  NGCT  and 
WGCT  can  lead  to  better  overall  estimates.  We  consider  a 
simple  fusion  method  using  a  weighted  sum  (here,  0  <  a  <  1) 

^ fused  [ft]  (43) 

a^narro-.v  [tt]  4*  (1  [it]  ■ 

vm.  Evaluation 

A.  Data  Set 

In  spectrogram  analysis/synthesis  and  speaker  separation,  we 
use  data  from  TTMIT  [15]  identical  to  that  in  [10].  For 
analysis/synthesis,  10  males  and  10  females  speaking  2 
distinct  utterances  are  used  for  a  total  of  40  examples.  In 
separation,  the  development  set  consists  of  5  male-male 
(MM),  female-male  (FM),  and  female-female  (FF)  mixtures 
while  the  test  set  consists  of  24  male-male  (MM),  24  female- 


where  oN  denotes  the  total  number  of  DFT  frequency  bins  in 
the  spectrogram  and  s' fuU  [n,  <u]  and  i/ull[n,m]  are  the 
original  and  reconstructions,  respectively,  normalized  to  have 
maximum  value  of  unity.  In  addition,  we  compute  the  signal- 
to-noise  ratio  (SNR) 

SNR  =  10  log  (  (46) 

\Xn  leln^~^singleln^\  1 

where  %Singie  M  is  waveform  estimated  obtained  in  combining 
Sfuii  [n,  <u]  with  the  phase  of  the  original  signal  [10][12], 

Figure  22  shows  results  of  a  single  female  utterance.  For 
display  purposes,  we  display  the  reference  and  reconstructed 
spectrograms  after  taking  them  to  the  power  of  0.5;  for 
reference,  we  show  also  an  “error”  spectrogram  computed  as 
the  square  difference  between  the  bootstrap  and  true 
spectrograms  after  normalization.  One  limitation  of  the 
demodulation  approach  (in  both  bootstrapping  and  direct 
methods)  is  a  “smoothing”  effect  on  onset/offset  structure, 
presumably  due  to  bandlimiting  of  the  envelope  term  in  the 
proposed  modulation  mode  (Figure  22,  time  750).  In  addition, 
both  methods  fail  to  capture  aperiodic  content  such  as  at  time 
500  as  may  be  due  to  glottalization  [12].  For  voiced  speech, 
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the  “enforcement”  of  periodic  carriers  and  their  use  as  guides 
for  reassignment  in  bootstrapping  are  evidently  insufficient  to 
fully  address  these  effects.  Overall,  however,  the 
reconstruction  results  demonstrate  that  speech  content  is 
generally  well-represented  by  the  modulation  model  with 
errors  values  ranging  from  7e-3  to  4e-2  on  a  scale  of  unity  as 
the  maximum  value.  Quantitatively,  bootstrapping  appears  to 
modestly  outperform  the  direct  method  (Table  2)  using  the 
RMSE  metric.  Nonetheless,  this  is  not  reflected  in  the 
resulting  waveforms  in  SNR,  presumably  due  to  phase  effects 
in  reconstruction.  In  informal  listening,  (non-author)  subjects 
did  not  distinguish  waveform  reconstructions  between 
bootstrapping,  direct,  and  the  original. 

Table  2  Average  RMSE  and  SNRs  for  analysis/synthesis  of  spectrograms 
and  standard  errors. 


Direct 

Bootstrapping 

RMSE  (Males) 

4.35e-2  F9.54e-31 

6.80e-3  r5, 01e-31 

RMSE  (Females) 

3.32e-2  f6.09e-31 

7.92e-3  [5.44e-41 

SNR  (dB)  (Males) 

24.61  [0.461 

22.15  [0.181 

SNR  (dB)  (Females) 

21.89  [0.39] 

23.04  [0.43] 

C.  Co-channel  Speaker  Separation 

In  speaker  separation,  the  mixed  signal  xmiX[n]  is  analyzed 
with  short-time  and  GCT  parameters  dentical  to  those  in 
analysis/synthesis.  We  compute  RMSE  errors  in  the 
spectrogram  estimate  as  in  analysis/synthesis  for  applications 
such  as  pre-processing  for  speech  recognition.  For  human 
listening,  reconstructed  spectrograms  are  combined  with  the 
phase  of  the  mixed  signal  to  obtain  a  waveform  estimate.  We 
denote  waveform  estimates  as  xt  [n] ,  and  we  compute  the 
signal  to  interferer  ratio  [10] 

SNRi  =  10 log  {  /)  (47) 

VSn[x(tn]-*i[n]]2y  V 

where  xjn]  is  the  original  (unmixed)  utterance.  In  fusion,  the 
a  parameter  was  swept  on  the  development  set  from  0  through 
1  with  a  step  size  of  0.01;  we  used  the  exclusion  method  from 
[10]  and  the  bootstrap  method  the  narrow  and  wideband 
estimates.  The  a  value  corresponding  to  the  highest  average 
SNR  across  all  waveforms  was  used  in  testing.  We  obtained  a 
“best”  value  of  a  =  0.56  to  be  applied  in  testing. 

Figures  23  and  24  show  the  results  of  wideband  based 
speaker  separation.  Demodulation  is  capable  of  suppressing 
harmonic  content  from  an  interferer  (e.g.,  time  750  (1200)  in 
Figure  23  (24)),  thereby  leading  to  separation  of  speakers.  A 
limitation  in  separation  can  be  observed  in  Figure  24  (time 
-1200)  where  the  onset  of  the  target  is  poorly  replicated  in  the 
estimate.  As  in  analysis/synthesis,  this  is  likely  due  to 
bandlimiting  of  the  envelope  term  in  demodulation. 
Quantitatively,  separation  can  result  in  RMSEs  on  the  order  of 
3e-2  (on  a  scale  of  unity  as  the  maximum  value)  and  4-6  dB 
global  SNR  gains  across  all  permutations  of  mixtures  (Table 
3,  Table  4).  In  general,  bootstrapping  appears  to  provide 
modest  gains  over  the  direct  method.  In  our  fusion  results,  we 
obtain  global  SNR  gains  up  to  -1  dB  over  either  narrow  or 
wideband  estimates  alone.  In  Figure  19,  the  narrowband 
estimate  provides  better  estimates  overall;  however,  the 
wideband  estimate  provides  complementary  information  in 
better  suppressing  content  from  an  interferer  at  time  ~2.6e4. 


In  informal  listening,  (non-author)  subjects  reported 
suppression  of  interfering  speakers  in  all  conditions  with 
faithful  reconstructions  of  the  target.  Fused  estimates  were 
reported  to  exhibit  less  “abrupt”  insertions  of  interfering 
speakers,  consistent  with  the  overall  gains  observed  in  SNR. 


Figure  19.  (a)  Fusion  estimate  and  truth  target  utterance  “appetite”;  (b) 
narrowband  estimate  of  target;  (c)  mixture  waveform  of  two  females 
(“Neither  his  appetite,  his  exacerbations,  nor  his  despair  were  akin  to 
yours.”  +  “Forty-seven  states  assign  or  provide  vehicles  for  employees 
and  state  business.”)  (d)  wideband  estimate  of  target;  note  suppression  in 
(b)  of  outstanding  interferer  in  (d).  fklh0.sil257.fmbg0.sill60.mix.  wav 


IX.  Conclusions 

This  work  has  proposed  a  model  of  speech  signal  content  as 
represented  in  2-D  analysis  of  wideband  spectrograms.  We 
have  validated  the  utility  of  this  model  for  representing  speech 
content  in  both  analysis/synthesis  and  co-channel  speaker 
separation  experiments.  In  conjunction  with  our  previous 
work,  the  model  motivates  a  novel  taxonomy  of  speech  signal 
behavior  in  the  2-D  Grating  Compression  Transform  (GCT) 
that  exhibits  important  distinctions  in  interpretation, 
particularly  in  relation  to  “dual”  behavior. 

One  implication  of  the  proposed  taxonomy  is  its  pntentiaL 
for  interpreting  other  time-frequencyjh^tabtTtions!^  For 
instance,  the  auditory  spectrograHrOfu]  is  generally  viewed 
as  being  “narrowband^Cj^ideband”  in  its  low/high-frequency 
regions.  The  periodicity-  and  formant-dependent  carrier 
derived  in  the  current  GCT  framework  may  be  applicable  to 
high-frequency  regions,  thereby  providing  an  explicit 
interpretation  for  modulation  components  observed  in  the 
auditory  spectrogram  in  relation  to  speech  parameters. 

As  suggested  by  our  results  in  speaker  separation,  the  GCT 
may  have  additional  applications  due  to  its  representation  of 
speech  parameters.  For  instance,  modifying  carrier 
components  in  the  WGCT  may  be  used  for  pitch  and/or 
formant  bandwidth  modification  in  voice  transformation.  As 
suggested  in  [10],  the  mapping  of  noise  and  speech  content  in 
distinct  regions  of  the  GCT  space  also  motivates  applicability 
to  speech  enhancement.  Finally,  the  present  speaker 
separation  framework  may  be  combined  with  existing  multi¬ 
pitch  tracking  methods  towards  a  full  separation  system. 


Table  3  Average  RMSEs  for  speaker  separation  and  standard  errors  [J  on 
test  set 


Direct 

Bootstrap 

MM 

3.38e-2  [1.4e-3] 

3.28e-2  [1.4e-31 

FF 

3.22e-2  [8.96e-3] 

3.19e-2  [8.81e-3] 

FM  -  Male 

2.77e-2  [8.61e-3] 

2.82e-2  [9.29e-3] 

FM  -  Female 

3.52e-2  [le-3] 

3.64e-2  [le-31 
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Table  4  Average  SNRs  (dB)  for  speaker  separation  (dB),  standard  errors 


[]  on  test  set 


Direct 

Bootstrap 

Narrow 

Fusion 

MM 

4.42  r0.121 

4.86  [0.15] 

3.67  [0.161 

5.43  [0.16] 

FF 

5.63  r0.181 

6.02  [0.19] 

6.30  ro.14] 

6.72  rO.17] 

FM-Male 

5.46  r0.131 

5.92  [0.14] 

4.83  [0.11] 

6.51  [0.12] 

FM  -  Female 

5.66  [0.141 

5.54  [0.15] 

5.71  [0.11] 

6.49  ro.12] 

Appendix  I 

Consider  a  time-varying  decaying  sinusoid  represented  by 
Green’s  function  g\n,m\,  where  m  is  the  time  of  excitation, 
and  n  is  the  time  axis  along  which  we  observe  the  resulting 
response  [12],  i.e., 


g[n,m ]  = 

cos (/” 0 (z) dz)  u [n  -  m]. 


(48) 


a(z)  and  <p(z)  are  integrable  functions  corresponding  to  the 
instantaneous  decay  rate  and  center  frequency  of  the  formant, 
respectively,  and  f  is  the  initial  amplitude  of  the  response. 
The  output  y[n]  of  g[n,m\  excited  by  p[n]  (4)  is  a 
superposition  sum  [12] 

yW  =  £m=-o ag[n,m]p[m].  (49) 

Substituting  (4)  and  (48)  into  (49),  we  obtain 

y[n]  = 

yNk  /e-i>^cos  (£*(r)d*ft  (50) 

k~°  \  u[n-kP ]  J 

Let  n0  denote  the  time  at  which  the  window  is  shifted  to 
extract  a  segment  y„0  [n]  of  y[n],  i.e., 

yn0W 

=  w[n  —  n0] 

yNk  ?(e~!kp  “Wdz  cos(J”  4>{z)dz 
ft=°H  u[n-kP ] 


We  make  a  further  approximation  by  assuming  that  each 
contribution  to  the  summation  across  k  in  (52)  is  aligned  at 
the  window  onset  such  that  it  may  he  viewed  as  a  scaled  and 
shifted  decaying  sinusoid,  thereby  ignoring  effects  of  the 
phase  terms  <p{z)dz  and  temporal  overlap,  i.e., 

y«0W  *  w[n  -  n0] 

Zkio  e~$dWd2h[n  -  n.;  no].  (53) 

ft[n;n0]  =  fe  a<n°)ncos(0(no)n)u[n]  (54) 


The  Fourier  transform  of  (53)  and  its  magnitude  are 


y(n0,  (ft)  *> 

Efc=oe-/^°“(z)d*  *u  H(co,n0)\e-l“n° 

71q)  -  + 

_ osf _ 

ftCnoJ+^O^M 

l^(n0>  w)ltocat  * 

w[n0,  (o]Ed  [no]/?  (n0,  at) 

_  EAnol  =  &e~'»aMdI 
H(ti0,cj)  =  | W(w)  H(oj,n0) |, 


(55) 

(56) 

(57) 

(58) 

(59) 


where  |/7(n0,co)|  is  a  smoothed  version  of  the  formant  and 
Ed  [n0]  is  a  time-dependent  amplitude  term.  In  (57),  we  have 
added  a  2-D  window  term  w[n,  <y]  to  emphasize  analysis  in  a 
local  time-frequency  region. 

While  Ed[n0]  is  not  a  periodic  function  in  general,  it  can  be 
made  periodic  in  P  under  certain  constraints  such  as  dr(z)  = 
a0  or  cr(z)  =  cos  (^z)  corresponding  to  constant  or 
sinusoidally-varying  decay  rates.  These  conditions  therefore 
allow  for  time- varying  formants  to  be  represented  as  a  general 
time-dependent  envelope  term  in  conjunction  with  a  periodic 
carrier.  For  instance,  d(z)  =  a0  reflects  a  condition  of 
constant  decay  but  potentially  changing  formant  frequency. 
For  periodic  Ed  [n0] ,  it  can  be  shown  that  the  2-D  Fourier 
transform  of  (57)  (i.e.,  the  WGCT)  is 


Consider  n0  in  (51)  such  that  the  entirety  of  the  window  is 
located  between  impulses  at  NkP  and  (Nk  +  1)P.  Within 
yno[n],  wc  assume  that  the  decay  rate  and  frequency  of  the 
sinusoid  are  constant  and  a  function  of  the  time  of  analysis  n0 

yn0M  *  w[n  -  n„] 

yNlc  f  e-(*(n0)n+4np°a(z)dz)  \  (52) 

k=0  \co s(<j>(n0)n  +  /"p°  <p(.z)dz )  u[n  -  kP]  j 

This  “frozen  time”  approximation  is  similar  to  that  assumed  in 
typical  short-time  analysis  methods  (e.g.,  linear  prediction 
[17])  to  invoke  stationarity  of  speech  parameters.  The 
contribution  of  the  fc*  component  in  (52),  although  time- 
varying,  appears  to  come  from  a  decaying  sine  with  constant 
decay  and  frequency.  Nonetheless,  its  starting  amplitude  (of 
the  decay)  and  phase  (of  the  sinusoid)  will  differ  as  a  function 
of  the  distance  between  n0  and  point  of  excitation  kP. 


Y(v,ii)  = 

w(u,n)  (60) 

[v(v,  fl)  *„  (/W(i0  +  Zil  0.5  Aff(w  ±  “)] 

where  T](v,  fl)  is  the  2-D  Fourier  transform  of  R(n0,  <u)  and  K, 
fi,  and  Nt  are  parameters  of  a  sinusoidal  series. 

Our  discussion  motivates  a  modulation  view  of  the 
wideband  spectrogram  to  include  time-varying  formant 
structure.  Nonetheless,  this  view  holds  only  approximately  in 
time  regions  away  from  excitation  impulse  onsets  due  to  the 
choice  of  the  window  position. 
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Figure  22.  (a)  Original  spectrogram  of  female  utterance  “You’ll  have  to  try  it  alone.”;  (b)  reconstruction  using  direct  method;  (c)  reconstruction 
using  bootstrapping  method;  (d)  “Error”  spectrogram  computed  as  the  square  difference  between  (b)  and  (a).  fceg0.sil878.0.specs.mat 
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Figure  23.  (a)  Mixture  spectrogram  of  a  male  (“They  were  shattered.”)  and  female  (“Neither  his  appetite”);  (b)  true  male  target;  (c)  male  estimate 
using  direct  method;  (d)  male  estimate  using  bootstrap  method.  For  display  purposes,  spectrograms  displayed  arc  taken  to  the  0.5  power 
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Figure  24.  As  in  Figure  23  but  for  two  female  mixtures  (“Oh  yes,  he  talked”,  “Anything  wrong  captain?”)  with  “talked”  target  utterance. 


